Introduction
Neighborhoods with a high queer population, or Gayborhoods, have been urban areas in which LGBTQ+ people have found community and built culture worldwide. However, these geographic areas serve as much more than the sexuality of their constituents, and have been cited as yielding robust creative economies, as well as a welcoming environment for those of many identities. Knowing this, the identity of a certain locale as a Gayborhood becomes a crucial sociological metric, with neighborhoods with a more prevalent queer identity driving social liberalism in the face of prejudice. Our analysis thus focuses on building a location-based regression model that uses a variety of parameters including housing, land use, and non-queer demographic data to predict Gayborhood degree, and, in so doing, determine whether an area is suitable for queer folks with the goal of advancing understanding of liberal areas.
Workflow Overview
Our initial modeling process consisted of initial data collection (Section 2), exploratory analysis and feature engineering (?@sec-FE), and modeling with v-fold cross validation. This was followed by creation of an intelligent ensemble model and subsequent assessment of our ensemble on data that was withheld from our training set. We assessed models mostly based on root mean sqared error, or RMSE, as this is a common and sensical metric for regression. However, we also include the concordance correlation coefficient, or CCC, to assess the amount of variation in our outcome variable captured by our model. This process is represented below, in figure 1:
Diagram of the initial modeling process
Outcome Variable
In order to rank Gayborhood status, we are using a published dataset from data world1 that was analyzed to display distribution of queer communities in Jan Diehm’s 2018 article Men are from Chelsea, Women are from Park Slope. Specifically, the metrics in this dataset are combined to form a Gayborhood index, which we will be using a modified version of as the supervising variable for our machine learning project. This index is a holistic assessment of queerness, derived primarily from the following measures:
- Same-sex joint tax filers
- Unmarried partner same sex households
- Number of Gay Bars
- Whether or not a pride parade routes through the region
In this case, we will be training a regression model to predict Gayborhood index. The initial variable exists on a 0 to 100 scale with a mean of 2.39, median of 1.27, and a max of 67.1. In Section 3, we will discuss our measures to reprocess the outcome variable.
Data collection
Because this project is based primarily on data by ZCTA, a US Census-defined zipcode proxy, we are able to pool data from two main sources – the US Census and Open ICPSR– to encompass 4 major facets of urban life.
- Our first predictor set2 comes from the US Census, and provides a variety of information about housing characteristics including number of housing units in an area, rent prices, and indicators of development (phone reception, income to rent ratio, etc.). This is derived from the Census’s ACS 5-year survey estimates for the year 2015. We use data from 2015 since that is also when the Gayborhood dataset was published.
- We also used US Census data to glean information about various demographic characteristics, as represented in our second data section3. This set primarily synthesizes demographic information that is not specifically related to sexuality (i.e. the parameters that went into the calculation of our outcome variable), but describe other characteristics of each area.
- We also opted to include data from Open ICPSR, which included information about parks4 in each geographical region, including number of open parks and proportion of ZCTA area that is park space. Since parks are public areas, they serve as a predictive metric for the culture of a community, and are thus important in describing neighborhood lifestyle.
- Finally, we included data concerning the primary method of transport to work5, also from the US Census ACS survey. This will round out our predictors by describing the movement, as well as the physical locale, within a neighborhood.
These yield a functional dataset with \(> 550\) predictors, which fall into the above categories. A skim of the first 10 variables is provided below:
| Name | Piped data |
| Number of rows | 2145 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| zip_code | 0 | 1 | FALSE | 2145 | 017: 1, 017: 1, 017: 1, 017: 1 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| gayborhood_index | 0 | 1 | 1.09 | 0.50 | 0 | 0.77 | 1.08 | 1.41 | 2.6 |
| estimate_sex_and_age_total_population | 0 | 1 | 29847.58 | 19524.92 | 304 | 15415.00 | 26827.00 | 39609.00 | 114982.0 |
| estimate_sex_and_age_total_population_male | 0 | 1 | 14550.75 | 9507.60 | 194 | 7460.00 | 13001.00 | 19268.00 | 60595.0 |
| estimate_sex_and_age_total_population_female | 0 | 1 | 15296.83 | 10078.78 | 110 | 7847.00 | 13772.00 | 20323.00 | 58520.0 |
| estimate_sex_and_age_under_5_years | 0 | 1 | 1924.05 | 1532.47 | 0 | 851.00 | 1584.00 | 2553.00 | 12098.0 |
| estimate_sex_and_age_5_to_9_years | 0 | 1 | 1865.10 | 1456.03 | 0 | 832.00 | 1585.00 | 2497.00 | 10215.0 |
| estimate_sex_and_age_10_to_14_years | 0 | 1 | 1841.39 | 1432.86 | 0 | 810.00 | 1544.00 | 2459.00 | 9785.0 |
| estimate_sex_and_age_15_to_19_years | 0 | 1 | 1880.63 | 1506.29 | 0 | 800.00 | 1527.00 | 2503.00 | 10857.0 |
| estimate_sex_and_age_20_to_24_years | 0 | 1 | 2105.32 | 1824.86 | 0 | 851.00 | 1667.00 | 2829.00 | 18938.0 |
| estimate_sex_and_age_25_to_34_years | 0 | 1 | 4717.06 | 3706.06 | 9 | 2024.00 | 3964.00 | 6408.00 | 26926.0 |
| estimate_sex_and_age_35_to_44_years | 0 | 1 | 4199.07 | 2902.13 | 0 | 2040.00 | 3730.00 | 5609.00 | 18739.0 |
Issues and Challenges
Collecting and cleaning the data was fairly easy. The census data is well organized, and generally has low missingness, which allowed us to only throw out a few potential predictors and will allow us to impute the rest. For cleaning of census data, we also removed columns that contained margins of error and annotations rather than the actual variables. The parks data required the removal of some text form within the data columns and conversion to numeric format. After cleaning individually, all datasets were then combined into a single dataset.
One main concern with our data is the ratio of observations to predictors, since the ratio is currently ~ 1:4. Because of this, most of our feature engineering will rely on targeted dimensionality reduction in order to explain the variance captured in these variables without overfitting to prediction variables.
Because census predictors were numerous, any variables with greater than 10% missing were removed. The missingness in any remaining variables is assumed to be random since these parameters were collected and reported in the US Census ACS survey.
Exploratory Data Analysis
Outcome Variable Transformation
To begin our exploratory analysis, we assessed methods for transformation of our outcome variable, an index measuring how queer a ZCTA region was based on parameters defined in this dataset from data world6. Based on the visualization below, a transformation should be applied to the outcome variable as it has a right skew, which will hinder the modeling process. Additionally, notice that the boxplot denotes many outliers within the data, which will lead to poor fits for models that can only generalize to common data. This report hopes to reframe this index so our measure will appear to have data generated from a normal distribution.
We originally planned to apply a log10 transformation to our outcome variable. However, the appearance of success from the log transformation was based off removal of 0 values, which are undefined (mapped to negative infinity). Examining the original data more closely, we find the following observations:
- Calculations for
totindex(the index measure) are not easily decipherable, and for a few ZCTAs, seem to be wrong entirely - There exist 0 values when some of the key queerness measures (e.g. number of same sex couples joint filing taxes) are positive
- Some zipcodes that are included in the data have no documented inhabitants (like 20535 and 94128)
In attempts to remedy these issues, we first dropped any zipcodes with \(<100\) inhabitants, and then recalculated our outcome variable, based on the following formula:
\[gayborhood\_index = 40*ss\_tax\_index + 40*ss\_live\_index + 10*parade + 10*bars\] where
- \(ss\_tax\_index\) is a measure of what percent of couples filing jointly are same sex couples (normalized by the maximum value, on a 0 to 1 scale)
- \(ss\_live\_index\) is a measure of what percent of unmarried couples living together are same sex couples (normalized by the maximum value, on a 0 to 1 scale)
- \(parade\) is an indicator variable indicating if a pride parade passes through that zipcode (0 or 1)
- \(bars\) is a measure of how many gay bars there are in that area (normalized by the maximum value, 20, on a 0 to 1 scale)
This calculation accurately repurposes the original dataset for our use, which is not as concerned with queer identity across gender identities as was the original article Men are from Chelsea, Women are from Park Slope.
We then used a Yeo-Johnson transformation with \(\lambda = -0.268\) (optimized by the applicable step_YeoJohnson() in tidymodels) to transform our data and improve normality. This transformation was chosen as it allows us to keep our 0 values, which have an important physical representation in this dataset, without greatly compromising normality:
Now, although the the transformed outcome variable still does not appear to have been generated from a normal distribution, there are no outliers, which will improve model performance.
Predictor Variable Selection
The provided data listed over 500 possible predictor variables. Fitting models to all potential predictor variables would not only be computationally expensive, but increase the variance of the predictions. Because of these issues, this report aims to mechanically select the most relevant predictors for the regression problem.
A LASSO model was fit to the training set to further narrow down the list of predictors by penalizing the addition of terms to a regression model. A lenient penalty of 0.01 was employed, along with steps to impute missing values with K-Nearest Neighbors and remove variables with near zero variance, which would have provided little help in the prediction problem. Ultimately, about 100 predictors remained to be used in the model fitting process.
Feature Engineering
In addition to the issue of having many possible predictor variables, there are varying degrees of missingness across the predictors. Many of the variables with larger degrees of missingness were removed in the process of lasso selection, and the remaining variables had low missingness (under 2%). The remaining missing values are summarized in the table below, with higher missingness listed first.
The missingness for the rest of the models was also resolved by imputing the variables with values from a K-Nearest Neighbors model.
Our model was based relationships between the Gayborhood index and all of the remaining predictors after lasso selection. Additional feature engineering steps include the following:
- removal of all near zero variance predictors (though this is ideally redundant after lasso selection)
- updating the role of the ID column to ensure it is not used as a predictor,
- dummy encoding categorical predictors,
- normalization of all predictors (centered and scaled to unit variance)
This feature engineering workflow was decided on after initial tinkering with the elastic net model as a base for understanding our data. In order to increase the amount of variation explained by this model, we included a version that included all pairwise interactions between the feature selected terms, and then proceeded to use partial least squares tuned to the outcome to reduce dimensionality. Number of components used in this model was tuned, yet it still performed worse than the base elastic net model, likely because the penalty term proxied for targeted dimensionality reduction.
Data Splitting and Folding
We split the data into training and testing sets with a proportion of 80% training and 20% testing, stratified by the outcome variable with 4 bins. We chose this as it is the convention for datasets that are large enough that withholding data from the training set is not a concern.
We used v-fold cross validation to tune hyperparameters while preventing overfitting to the training data. We used 5 folds with 3 repeats as opposed to larger numbers to reduce model runtimes.
Modeling Process
Model Types
We fit the following model types:
- Penalized Elastic Net Regression
- with and without partial least squares
glmnetengine
- Random Forest
rangerengine
- Boosted Tree
xgboostengine
- K-Nearest Neighbors
kknnengine
- Neural Networks
- Using
kerasand Tensorflow from python through R
- Using
- Multivariate Adaptive Regression Splines (MARS)
earthengine
- Cubist ensemble regression
cubistengine
Model Results
We found that the Cubist ensemble performed the best, with a mean RMSE of .328. Elastic net regression (without PLS) and random forest were also in the top 3 models, all with RMSEs under .34. K-nearest neighbors and boosteed performed almost as well, both under .35. There is then a significant gap to the worse models, those being elastic net with PLS, multivariate adapative regression splines, and neural networks with keras.
| Initial Model Comparison | |||
| Model Type | Best Mean RMSE | Standard Error | Optimal Hyperparameters |
|---|---|---|---|
| Cubist Ensemble Regression (BI) | 0.328 | 0.004 | Committees = 88, Max Rules = 290, Neighbors = 0 |
| Elastic Net (BI) | 0.337 | 0.004 | Penalty = 1.1e-08, Mixture = 0.051 |
| Random Forest (BI) | 0.339 | 0.003 | Mtry = 91, Trees = 1900, Min N = 3 |
| K-Nearest Neighbors (BI) | 0.343 | 0.003 | Neighbors = 22, Dist Power = 1 |
| Boosted Trees (BI) | 0.348 | 0.003 | Min N = 38, Learn Rate = 0.25, Loss Reduction = 1.2e-09 |
| Elastic Net (PLS) (BI) | 0.370 | 0.003 | Penalty = 0.0032, Mixture = 1 |
| Multivariate Adaptive Regression Splines (BI) | 0.376 | 0.003 | Num Terms = 5, Prod Degree = 1 |
| Neural Network (Keras) | 0.398 | 0.004 | Penalty = 1, Hidden Units = 5 |
| Null Model | 0.503 | 0.002 | NA |
Ensemble Model
Our three best model types, cubist ensemble regression, elastic net regression, and random forest, were combined in an ensemble model with stacks. The final ensemble kept 6 configurations of cubist ensemble, 2 configurations of elastic net, and 2 configurations of random forest. Here is a table of the individual model performances:
| Ensemble Results | |
| Model Name | RMSE |
|---|---|
| cer_bayes_1_23 | 3.196 |
| cer_bayes_1_24 | 3.216 |
| ensemble | 3.304 |
| cer_bayesIter9 | 3.318 |
| cer_bayes_1_09 | 3.376 |
| cer_bayes_1_06 | 3.378 |
| cer_bayes_1_05 | 3.414 |
| en_bayes_1_1 | 3.466 |
| en_bayesIter3 | 3.466 |
| rf_bayesIter1 | 3.673 |
| rf_bayes_1_3 | 4.326 |
Final Results
The RMSE of the final ensemble model was 3.30. Interestingly, this is higher than two of the ensemble members (both cubist ensemble models). The concordance correlation coefficient (CCC) of the final ensemble was .714.
Here is a plot of predicted vs true values of the Gayborhood index:
happy pride <3 such a well timed project tbh
Future Directions
we can say future research would include more recipe steps to test like splines (natural, b, whatever other step_spline stuff there is), PCA, etc.
we might actually want to try baguette instead, i feel like it slays, but i didnt do the reading so i owuldnt know.
Footnotes
Jan Diehm. (2018). Gayborhoods [Data set]. The Pudding, data.world. https://data.world/the-pudding/gayborhoods. Accessed 10 April 2023.↩︎
US Census. (2021). DP04: SELECTED HOUSING CHARACTERISTICS [Data set]. data.census.gov. https://data.census.gov/table?q=rent&g=010XX00US$8600000&tid=ACSDP5Y2021.DP04. Accessed 24 April 2023.↩︎
US Census. (2015). DP05: ACS DEMOGRAPHIC AND HOUSING ESTIMATES [Data set]. data.census.gov. https://data.census.gov/table?g=010XX00US$8600000&tid=ACSDP5Y2015.DP05. Accessed 26 April 2023.↩︎
Li, Mao, Melendez, Robert, Khan, Anam, Gomez-Lopez, Iris, Clarke, Philippa, and Chenoweth, Megan. (2020). National Neighborhood Data Archive (NaNDA): Parks by ZIP Code Tabulation Area, United States, 2018. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. https://doi.org/10.3886/E119803V1. Accessed 23 April 2023↩︎
US Census. (2015). S0802: MEANS OF TRANSPORTATION TO WORK BY SELECTED CHARACTERISTICS [Data set]. data.census.gov. https://data.census.gov/table?q=commuting&g=010XX00US$8600000&tid=ACSST5Y2021.S0802. Accessed 30 April 2023.↩︎
Jan Diehm. (2018). Gayborhoods [Data set]. The Pudding, data.world. https://data.world/the-pudding/gayborhoods. Accessed 10 April 2023.↩︎